bad arm
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.87)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.66)
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.46)
Principal-Agent Bandit Games with Self-Interested and Exploratory Learning Agents
Liu, Junyan, Ratliff, Lillian J.
We study the repeated principal-agent bandit game, where the principal indirectly interacts with the unknown environment by proposing incentives for the agent to play arms. Most existing work assumes the agent has full knowledge of the reward means and always behaves greedily, but in many online marketplaces, the agent needs to learn the unknown environment and sometimes explore. Motivated by such settings, we model a self-interested learning agent with exploration behaviors who iteratively updates reward estimates and either selects an arm that maximizes the estimated reward plus incentive or explores arbitrarily with a certain probability. As a warm-up, we first consider a self-interested learning agent without exploration. We propose algorithms for both i.i.d. and linear reward settings with bandit feedback in a finite horizon $T$, achieving regret bounds of $\widetilde{O}(\sqrt{T})$ and $\widetilde{O}( T^{2/3} )$, respectively. Specifically, these algorithms are established upon a novel elimination framework coupled with newly-developed search algorithms which accommodate the uncertainty arising from the learning behavior of the agent. We then extend the framework to handle the exploratory learning agent and develop an algorithm to achieve a $\widetilde{O}(T^{2/3})$ regret bound in i.i.d. reward setup by enhancing the robustness of our elimination framework to the potential agent exploration. Finally, when reducing our agent behaviors to the one studied in (Dogan et al., 2023a), we propose an algorithm based on our robust framework, which achieves a $\widetilde{O}(\sqrt{T})$ regret bound, significantly improving upon their $\widetilde{O}(T^{11/12})$ bound.
- Asia > Middle East > Jordan (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
An Adaptive Approach for Infinitely Many-armed Bandits under Generalized Rotting Constraints
Kim, Jung-hun, Vojnovic, Milan, Yun, Se-Young
In this study, we consider the infinitely many-armed bandit problems in a rested rotting setting, where the mean reward of an arm may decrease with each pull, while otherwise, it remains unchanged. We explore two scenarios regarding the rotting of rewards: one in which the cumulative amount of rotting is bounded by $V_T$, referred to as the slow-rotting case, and the other in which the cumulative number of rotting instances is bounded by $S_T$, referred to as the abrupt-rotting case. To address the challenge posed by rotting rewards, we introduce an algorithm that utilizes UCB with an adaptive sliding window, designed to manage the bias and variance trade-off arising due to rotting rewards. Our proposed algorithm achieves tight regret bounds for both slow and abrupt rotting scenarios. Lastly, we demonstrate the performance of our algorithm using numerical experiments.
- North America > United States (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.66)
Bandit Social Learning: Exploration under Myopic Behavior
Banihashem, Kiarash, Hajiaghayi, MohammadTaghi, Shin, Suho, Slivkins, Aleksandrs
Reviews and ratings are pervasive in many online platforms. A customer consults reviews/ratings, then chooses a product and then (often) leaves feedback, which is aggregated by the platform and served to future customers. Collectively, customers face a tradeoff between exploration and exploitation, i.e., between acquiring new information while making potentially suboptimal decisions and making optimal decisions using available information. However, individual customers tend to act myopically and favor exploitation, without regards to exploration for the sake of the others. On a high level, we ask whether/how the myopic behavior interferes with efficient exploration. We are particularly interested in learning failures when only a few agents choose an optimal action.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Maryland (0.14)
- Research Report (0.64)
- Instructional Material (0.46)
- Education > Curriculum (0.41)
- Energy > Oil & Gas > Upstream (0.34)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Rotting infinitely many-armed bandits
Kim, Jung-hun, Vojnovic, Milan, Yun, Se-Young
We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate $\varrho=o(1)$. We show that this learning problem has an $\Omega(\max\{\varrho^{1/3}T,\sqrt{T}\})$ worst-case regret lower bound where $T$ is the horizon time. We show that a matching upper bound $\tilde{O}(\max\{\varrho^{1/3}T,\sqrt{T}\})$, up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate $\varrho$. We also show that an $\tilde{O}(\max\{\varrho^{1/3}T,T^{3/4}\})$ regret upper bound can be achieved by an algorithm that does not know the value of $\varrho$, by using an adaptive UCB index along with an adaptive threshold value.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Finland > Uusimaa > Helsinki (0.04)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.67)
Learning in Restless Multi-Armed Bandits via Adaptive Arm Sequencing Rules
We consider a class of restless multi-armed bandit (RMAB) problems with unknown arm dynamics. At each time, a player chooses an arm out of N arms to play, referred to as an active arm, and receives a random reward from a finite set of reward states. The reward state of the active arm transits according to an unknown Markovian dynamics. The reward state of passive arms (which are not chosen to play at time t) evolves according to an arbitrary unknown random process. The objective is an arm-selection policy that minimizes the regret, defined as the reward loss with respect to a player that always plays the most rewarding arm. This class of RMAB problems has been studied recently in the context of communication networks and financial investment applications. We develop a strategy that selects arms to be played in a consecutive manner, dubbed Adaptive Sequencing Rules (ASR) algorithm. The sequencing rules for selecting arms under the ASR algorithm are adaptively updated and controlled by the current sample reward means. By designing judiciously the adaptive sequencing rules, we show that the ASR algorithm achieves a logarithmic regret order with time, and a finite-sample bound on the regret is established. Although existing methods have shown a logarithmic regret order with time in this RMAB setting, the theoretical analysis shows a significant improvement in the regret scaling with respect to the system parameters under ASR. Extensive simulation results support the theoretical study and demonstrate strong performance of the algorithm as compared to existing methods.